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FOREWORD 


This  report  was  written  in  an  attempt  to  give  the  scientist, 
engineer  or  laboratory  manager  who  has  a  limited  background  in  the  area 
of  statistics  a  better  understanding  of  the  methods  of  correlation  and 
regression.  A  number  of  examples  have  been  given  to  illustrate  the  meth¬ 
ods  that  have  been  developed,  together  with  numerous  graphical  represen¬ 
tations  to  give  the  reader  a  pictorial  description  of  many  interlocking 
relationships.  No  prior  knowledge  of  statistics  is  assumed,  but  some 
experience  in  mathematics  beyond  calculus  is  necessary  to  fully  compre¬ 
hend  the  theoretical  development.. 

No  attempt  has  been  made  in  this  article  to  discuss  such  topics 
as  ^nfidence  intervals  or  tests  of  hypotheses  involving  the  correlation 
coetiicic.at  or  regression  equation.  While  these  topics  are  certainly 
important  statistical  concepts,  it  was  considered  desirable  to  restrict 
the  scope  of  the  text  to  interpretations  of  the  principles  of  correlation 
and  regression. 

Over  the  past  year,  the  author  has  been  a  statistical  consul¬ 
tant  to  the  Office  for  Laboratory  Management  in  the  Office  of  the 
Director  of  Defense  Research  and  Engineering.  He  was  motivated  to  pre¬ 
pare  this  review  as  a  preface  to  a  series  of  studies  utilizing  these 
statistical  techniques  that  will  be  published  in  the  near  future. 


i  i  i 


CONTENTS 

Page 

Foreword -  iii 

1.  Introduction -  1 

2.  Regression  in  Two  Variables -  2. 

3.  Correlation  in  Two  Variables - 5 

4.  Describing  the  Data - 9 

5.  Spearman's  Rank  Correlation  Coefficient -  15 

6.  Multiple  Linear  Regression -  17 

7.  The  Correlation  Matrix -  20 

8.  Multiple  Correlation  Coefficients -  22 

9.  Geometrical  Considerations -  23 

10.  Partial  Correlation  Coefficients-  -  2^ 

Bibliography - -  29 


v 


INTRODUCTION 


The  met  lods  of  regression  have  been  used  qui'e  extensively  in 
recent  years  in  the  prediction  of  certain  random  phenomena  cn'lcd  variables, 
Frequently  it  is  difficult  to  obtain  observations  an  one  variable, or  observations 
can  be  obtained  only  after  a  considerable  time  delay.  In  such  cases  it  is 
often  desirable  to  establish  a  relationship  between  this  variable  and  one  or 
more  other  variables  which  can  more  easily  be  observed  so  that  the  fonner 
variable  can  be  predict  1.  For  example,  a  student's  college  point  average  is  a 
variable  which  can  be  observed  only  after  the  completion  of  four  years  of 
work,  while  the  student's  hi^h-school  grades  and  scores  on  various  college 
entrance  examinations  are  available  prior  to  ais  enrollment  into  college. 

Many  educators  have  been  interested  in  predicting  college  success  from  such 
precollege  records.  Another  major  concern  in  our  society  today  is  that  if 
predicting  a  man's  salary  from  age,  educational  background  and  type  of  work 
data.  The  personnel  offices  of  large  corporations  must  know  what  factors 
influence  salaries  so  that  they  can  remain  competitive  with  other  organizations. 

In  regression  the  variable  which  is  being  predicted  is  often  referred 
!  o  as  the  dependent  variable, and  the  variables  used  in  the  predicting  are 
referred  to  as  the  independent  variables.  The  initial  step  in  an  investigation 
requires  that  observations  be  made  on  both  the  dependent  and  independent 
variables.  Using  this  data  a  relationship  is  established  which  best  describes 
the  trend  of  the  observations  between  the  dependent  and  independent  variables 
so  that  in  the  future  it  is  onlv  necessary  to  observe  the  independent  variables 
to  make  a  r  -linble  prediction  as  to  what  the  value  of  the  dependent  variable 
would  be  if  observed.  For  example,  in  a  college  admissions  office,  a  study 
might  be  made  of  current  college  seniors  using  point  average  as  the  dependent 
variable  and  the  student's  high  school  grades  and  scores  on  college  entrance 
examinations  as  the  independent  variables  so  that  only  students  with  a  high 
predicted  point  average  would  be  admitted  the  following  year.  The  relation¬ 
ship  that  is  established  using  the  observa  us  from  the  dependent  and 
independent  var'ables  will  be  called  the  regression  equation, and  the  mechanics 
of  establishing  this  relationship  will  be  discussed  in  the  text  of  this  article 

It  should  be  pointed  out  that  in  regression  one  is  not  necessarily 
seeking  perfect  prediction  of  the  dependent  variable.  Often  the  dependent 
variable  will  be  strongly  related  to  one  or  possibly  two  variables  and  only 
weakly  related  to  a  number  of  others.  This  Is  exhibited  in  the  salary  problem, 
where  age  and  educational  level  are  strong  factors  and  such  things  as 
personality  traits,  compatibility  with  fellow  employees  and  others,  while 
sometimes  important,  for  most  employees  have  little  effect  and  are  difficult 
to  measure  in  terms  of  salary  effect.  In  regression,  therefore,  one  Is  only 
seeking  ,  relationship  involving  those  variables  for  which  the  dependent 
variable  is  strongly  related.  No  interpretation  of  the  term  "strongly  related" 
will  be  given  here, since  the  interpretation  may  depend  upon  the  particular 
investigation  being  conducted. 
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Having  •)btaim;u  <t  i  t-^iKbsioii  equation,  sc  is  oiten  desirable  to  obtain 
a  n'easurr  >f  the  strength  oi  the  hypothesized  relationship,  where  the  term 
strength  will  be  taken  to  mean  the  degree  to  which  ti  e  data  follows  the  rcgres  n 
equation.  Several  correlation  techniques  have,  thoretoro,  been  rieveloped  to 
describe  the  strength  of  the  -egression  equation  and  interpretations  will  be 
given  as  to  the  meaning  of  the  results. 

As  a  matter  of  notation,  capital  letters  (sued  as  Y,  X,  X^)  will  be 
useu  to  designate  variables  and  small  letters  (such  as  yj,  xj-;,  x^.)  used  to 
designate  particular  observations  from  the  variables.  Small  letters  will 
also  be  ,.sed  in  writing  the  regression  equations. 

REGRESSION  IN  TWO  VARIABLES 

Suppose  an  investigator  is  interested  in  studying  the  relationship 
between  salary  and  ase  of  the  professional  employees  at  Company  A.  The 
investigator  initially  believes  that  a  relationship  exists  because  he  conjec¬ 
tures  that  an  older  man,  in  general,  will  nave  more  experience  in  his  profes¬ 
sion  than  nis  younger  colleague,  and  therefore  will  be  more  valuable  to  his 
employer.  It  is  realized  by  the  investigator  that  age  is  by  no  means  the 
only  factor  involved, but  it  is  conjectured  that  at  least  some  trend  will  be 
present.  A  plot  is  therefore  made  of  the  observations  on  the  variable  Y 
(salary  in  dollars /month)  versus  the  observations  on  the  variable  X  (age 
in  years  ]  for  the  professional  employees  of  the  company  and  p  resented  in 
the  upper  graph  of  Figure  1  along  with  the  data.  A  casual  observation  shows 
that  there  appears  to  be  an  increasing  trend  of  salary  with  age,  hut  the 
investigator  wishes  to  make  more  than  just  a  subjective  nnraisal  of  the 
data.  A  further  examination  indicates  that  the  trend  appears  to  be  linear, 
so  the  investigator  wishes  to  find  an  equation  ot  the  torn) 

y  at  ox  ( 1  ) 

which  describes  t!  rend  of  the  observations  rum  the  variables  X  and  Y. 

That  is.  for  this  se*  of  data,  determine  constants  a  and  b  so  that  an 
employee’s  salary  in  dollars  per  month  can  !>••  predicted  by  multiplying  iiis 
age  in  years  by  the  constant  b  and  adding  the  cons, ant  a.  Ii  should  be 
emphasized  that  tiie  quantity  u  t  bx  only  serves  as  a  predicted  salary  lor  an 
en'ohiyee  of  age  x  years  based  upon  a  trend  determined  by  the  entire 
professional  population  and  that  a  par  t  ieula  r  individual’s  salary  might  ditier 
conside  rablv  Irani  its  predicted  valm  ,  owing  to  the  inlluer.ee  .qher  factors 
besides  age  (such  as  educational  background  and  personality  traits)  which 
affect  an  employee's  salary. 

The  investigator  is  thus  interested  m  determining  the  i  m. stands  in 
Equation  (!)  so  that  this  equation  best  represents  the  trend  established  bv 
Company  A.  The  phrase  ’’best  represent  s'  could  be  interpreted  quite 


differently  ;>v  *  art'ms  inves,;  gator  s ,  so  it  is  desirable  t  <  intioduce  the 
standard  lee: Im; q ne  of  ! Utility  a  lint-  t<.  the  data  k..;>wn  as  the  method  of  leas' 
squa  res  - -that  is  in.d  constants  a  and  b  in  Equation  (1)  such  that  the  sun 


n  -> 

S  --  E  (y,  -  a  -  l  x  ) 

1  -- 1  1  1 


(2) 


is  minimized  where  the  sum  is  taken  over  a1!  pairs  (x- ,  yj)  repre  enting 
the  age  and  salary  pair  of  the  ith  employee.  It  will  be  recalled  that  a  +  bx^ 
will  represent  the  predicted  salary  for  the  ith  employee  ^so  the  quantity 

d .  -  v  -  (at  bx . ) 

l  1 1  i ' 


represents  the  ith  employee's  salary  deviation  from  the  linear  trend  of  the 
professional  population  of  Company  A.  From  Equation  (Z)  the  method  of  least 
squares  is  seen  to  ti..J  constants  a  and  b  to  make  the  sum  of  the  squares  of 
these  deviations  as  stnall  as  possible.  The  details  will  not  be  stated  here, 
but  it  can  be  shown  that  the  constants  a  and  b  satisfying  the  least  squares 
criterion  can  be  found  for  Equation  (Z)  by  obtaining  the  partial  derivatives 
9  S/3  a  and  3S/3b  and  then  solving  simultaneously  the  two  equations 


3  S 
8  a 


3  S 
3b 


tor  a  and  b.  The  result  will  yield  the  values 


x.v.  -  11  xv 


“  x 

i.l  i 


y  -  ox 


wnere  x 


E  x  ,  and  v  X  v.  , 
i  n  i  n 

i  ■  1  !  1 


(3) 


When  n  and  1>  are  calculated  from  (•■),  Equation  (1)  is  known 
as  the  linear  ’•egression  of  Y  on  X.  From  (A  it  is  seen  that  the  constant  a 
is  determined  di  rei  tlv  from  b  ami  the  two  means.  Thus,  a  desirable 
alternative  term  o'  the  linear  regression  can.  be  obtained  by  the  direct 
substitution  tor  the  constant  a  int-  Equation  (1).  yielding  the  modified  form 


V  ;  i>  (x  -  x). 


(3a) 


This  tom'  is  desired  by  many  because  it  explicitly  e 
data  used  to  obtain  the  regression  equation. 


.'mbits  the  means  of  the 
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Using  the  data  given  in  Figure  1,  the  linear  regression  has  been 
calculated  and  this  line  plotted  through  the  data  in  the  second  graph  of 
Figure  1.  Geometrically,  the  deviations  dj  represent  the  vertical  distance 
from  the  regression  line  to  the  point  (x£,  yj)  as  illustrated  in  the  figure. 
The  regression  line  has  the  properties  that 


d,  =  0 


n  2 

2  d:  :  -  - 

i=l 

are  minimized.  It  should  be  mentioned  that,  if  X  is  considered  to  be  the 
dependent  variable  instead  of  Y,  then  the  regression  of  X  on  Y  cannot  be 
found  by  solving  the  equation  y  =  a  +  bx  for  x.  Rather,  one  wishes  to  find 
constants  a'  and  b*  in  the  equation  x  =  a'  +  b'y  such  that  the  sums  of  squares 
of  the  horizontal  distances  from  the  points  (x£,  yj)  to  the  curve  x  =  a*  +  b'y 
are  minimized.  This  solution  for  a'  and  b*  may  be  considerably  different 
from  the  constants  found  by  the  inversion  of  the  equation  y  =  o'+  bx,  especially 
if  the  sample  size  is  small. 


3.  CORRELATION  IN  TWO  VARIABLES 

•  jS  .  *  jl  . 

Having  obtained  a  regression  equation,  the  investigator  is  often 
interested  in  determining  the  strength  of  this  relationship  obtained  by  his 
regression  technique.  One  of  the  most  frequently  used  measures  of  this  fit 
is  the  Pearson  product  moment  correlation  coefficient  obtained  from  the 
observations  on  the  variables  X  and  Y  by  the  equation 


r  =  1=1  Xft-~  — - .  <  (4) 

.  i  ' V'y 

2  -2,  “  2  -2,  i  ' 

A?i  xi  -  “  >  (Si y,  - 

Some  of  the  major  properties  of  this  correlation  coefficient  are  as  follows: 


{a)  For  any  set  of  points  -1  <  r  <  +  1  .  '4 

(b)  r  =  +  1  if  the  points  lie  on  a  straight  line  with  positive  slope. 

(c)  r  =  -  1  if  the  points  lie  on  a  straight  line  with  negative  slope  < 
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(d)  r  will  be  close  to  zero  if  the  data  has  no  linear  trend  . 

(e)  If  each  xj  is  multiplied  by  a  positive  constant,  the  value  of 
r  is  unchanged  . 

(f)  If  a  constant  is  added  to  each  x^,  tbe  value  of  r  is  unchanged. 

Property  (a)  gives  the  range  of  the  correlation  coefficient,  while  properties 
(b),  (c),  and  (d)  give  special  significance  to  three  particular  values.  In 
addition,  a  positive  correlation  less  than  one  indicates  that  the  slope  b  of 
the  regression  line  is  positive  but  that  the  data  points  do  not  fall  on  a  straight 
line.  Similarly,  a  negative  correlation  indicates  the  slope  of  the  regression 
line  is  negative.  The  relationship  between  the  slope  of  the  regression  line 
and  the  correlation  coefficient  is  exhibited  more  explicitly  by  the  equation 


obtained  from  Equations  (3)  and  (4).  To  further  illustrate  these  properties, 
six  hypothetical  sets  of  data  have  been  given  in  Figure  1.  1  together  with 
their  corresponding  correlations  coefficients.  Properties  (e)  and  (f)  indicate 
that  certain  linear  transformations  of  the  x.'s  do  not  alter  the  value  of  the 
correlation  coefficient.  Such  transformations  are  often  very  helpful  in 
simplifying  computations,  especially  if  the  correlation  is  calculated  by 
hand  or  with  a  desk  calculator.  Property  (e)  also  implies  that  the  value 
of  the  correlation  coefficient  does  not  depend  upon  the  units  of  the 
xi ' s •  That  is ,  if  x^  is  a  length ,  the  sane  correlation  will  be  obtained 
whether  the  x^'s  are  recorded  in  inches,  feet,  centimeters  or  miles. 


To  give  a  specific  meaning  for  all  values  -1  <  r  <  +  1  it  can  be 
shown  by  squaring  both  sides  of  Equation  (4)  that 


2 

r 


1 


a  -  bx.)2 

l 


n 

Z 

i=l 


(Yi  -  Y)2 


(5) 


where  a  and  b  are  from  Equation  (3).  Noting  that  b  =  0,  a  =  y  is  a  solution 
which  makes  the  sum  of  squares  in  the  numerator  equal  to  the  sum  of  squares 
in  the  denominator,  the  solution  given  by  (3)  must  of  necessity  give  a  sum  of 
squares  in  the  numerator  which  is  no  larger  than  the  denominator,  since  this 
solution  was  obtained  so  as  to  minimize  the  numerator.  The  square  of  the . 
correlation  coefficient  r^  thus  has  range  0  <  r^  <  1  and  is  often  referred  to 
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as  the  coefficient  of  determination.  Using 


,  # 

a . 


-  y 


to  represent  the  deviation  of  the  ith  observation  from  the  mean  y,  it  can 
be  seen  from  Equation  (5)  and  Figure  2  that  can  be  interpreted  as  the 
fraction  reduction  in  the  variation  of  the  variable  Y  when  the  sum  of 
squares  is  measured  from  the  regression  line  of  Y  on  X  instead  of  from 
the  mean  y.  Another  way  of  expressing,  this  would  be  to  say  that  the  linear 
regression  of  Y  on  X  explains  ICO  of  the  variation  of  the  variable  Y. 

In  the  salary  versus  age  data  ven  in  Figures  1-2  the  correlation 
coefficient  was  calculated  from  Equation  (4)  and  found  to  be  0.  768.  From 
this,  the  coefficient  of  determination  was  calculated  and  found  to  oe  0.  590. 
Thus,  for  Company  A,  the  linear  dependence  of  salary  on  age  explains 
59.  0%  of  the  variation  in  salary. 


Another  useful  measure  of  the  strength  of  the  regression  equation 
is  the  standard  error  of  estimate  Se  defined  by  the  equation 


(y  -a  -  bx  ) 
i  i 


2 


(6) 


where  q  and  b  are  defined  by  (3).  This  measure  is  based  upon  the  square 
of  the  deviations  of  the  points  from  the  regression  line  and  therefore 
measures  the  spread  of  the  points  about  the  regression  line.  The  quantity 
Se  is  equal  to  zero  if  all  points  lie  on  a  straight  line  and  is  positive  other¬ 
wise,  with  larger  values  of  Se  indicating  weaker  linear  trends. 


4.  DESCRIBING  THE  DATA 

The  quantities  b,  r  and  Se  are  all  useful  in  describing  char¬ 
acteristics  of  the  data,  but  care  must  be  taken  to  avoid  misinterpretations. 
The  correlation  coefficient  r  is  greatly  dependent  upon  the  slope  of  the 
regression  line  b,  so  that  two  sets  of  data  which  look  quite  different  on  a 
graph  may  have  nearly  identical  correlation  coefficients.  To  more  clearly 
exhibit  this  property,  salary  versus  age  data  has  been  gathered  on  the 
professional  employees  of  companies  A,  B,  and  C  in  Figure  3,  and  a  graph 
of  this  data  presented  in  Figure  4.  Below  the  data  in  Figure  3,  and  also 
in  Figure  4,  a  number  of  statistics  have  been  calculated  in  an  attempt  to 
adequately  summarize  the  data  from  the  three  companies.  A  com  pa  risori 
between  companies  A  uid  B  indicates  that  if  the  standard  error  of  estimate, 
Se,  is  held  fairly  constant,  an  increase  in  the  slope  b  will  cause  an  increase 
in  the  correlation  coefficient  r.  Note  that  companies  B  and  C  have  similar 
slopes,  hut  in  this  case  an  increase  in  iho  standard  error  of  estimate,  S  , 
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Figure  3 
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produces  a  cor  responding  decrease  in  the  correlation  coefficient,  r,  for 
Company  B.  A  comparison  between  A  and  C  shows  that  these  two  effects 
can  be  neutralized.  That  is,  both  companies  have  similar  correlations  rt 
but  company  A  has  a  much  larger  standard  error  of  estimate,  S  ,  and  also 
larger  slope  b.  This  last  comparison  :,,nstrates  that  two  sets  n  data  can 
have  similar  correlation  coefficients  but  exhibit  different  linear  tre.  ds. 
This  results  because  the  correlation  coefficient  as  shown  by  Equation  J5) 
measures  only  a  relative  reduction  in  the  sum  of  squares  of  the  variable 
Y  due  to  the  linear  regression  of  Y  on  X. 

In  summary,  the  three  quantities  Se,  b  and  r  individually  give  the 
investigator  only  limited  information  about  the  data,  but  together  they 
provide  a  good  description  of  the  linear  relationship  between  the  dependent 
and  independent  variables.  That  is,  b  gives  the  slope  of  the  regression 
line,  Sg,a  measure  of  the  spread  about  the  regression  line  and  r^  the 
relative  reduction  in  the  sum  of  squares  due  to  the  linear  regression. 

Another  interesting  aspect  of  the  data  in  Figures  5-4  is  that  the 
predicted  salary  for  a  man  of  age  30,  which  can  be  found  by  substituting  the 
value  x  =  30  into  Equation  (3a)  for  each  company,  is  about  the  same  for 
all  three  companies,  as  shown  in  the  ta  1. ' e  below. 

Predicted  Salary  at  Age  30 
in  Dollars/Month 


Company  A 

858 

Company  B 

847 

Company  C 

8b  1 

Thus  the  starting  salary  for  a  young  professional  might  well 
he  the  same  at  all  three  companies,  but  because  of  the  advancement  policy 
of  the  individual  companies,  the  yearly  increases  might  vary  drastically 
from  company  to  company. 

Finally,  it  should  be  mentioned  that  when  one  uses  the  correlation 
and  regression  techniques,  special  care  should  he  taken  to  eliminate  ail 
errors  from  the  data.  The  method  of  least  squares  is  especially  sensitive 
to  extraneous  points  and  even  a  tew  had  points  among  a  few  hundred  can 
drastically  alter  the  results.  As  an  illustration,  let  us  suppose  that  the 
salary  of  the  last  individual  in  Company  C  was  recorded  erroneously  as 
$  7  55 /month  m  Figure  3.  Figure  4.  1  shows  a  plot  ot  the  regression  lino 
w.,h  this  bad  point  included  in  the  data  in  addition  to  the  recalculation  of 
the  regression  equation  after  this  point  has  been  eliminated.  The  eifect  of 
this  one  bad  point  is  surprising  unless  one  is  familiar  with  ihe  mathematical 
analysis.  It  is.  therefore,  often  advisable  to  plot  the  data  on  a  graph  and 
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Figure  5 
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check  the  validity  of  any  point  that  does  not  appear  to  follow  the  trend 
established  by  the  majority  ot  the  data  po  ats. 

S.  SPEARMAN'S  RANK  CORRELATION  COEFFICIENT 

Another  measure  of  the  correlation  between  two  variables  that 
can  be  used  is  Spearman's  rank  correlation  coefficient.  To  obtain  tnis 
measure  for  a  set  of  data  defined  by  the  pairs  (xj,  yt),  the  coordinates 
must  lx  ranked  with  respect  to  the  two  variables;  thus,  the  smallest  x; 
is  given  rank  one,  the  second  smallest  rank  two,  and  continuing  until 
the  largest  Xj  is  given  rank  n.  Define,  therefore,  Rxj  to  be  the  rank  of 
the  coordinate  x  among  the  n  observations  on  the  variable  X  and 
similarly  define  the  rank  R;q.  Using  then  the  data  (Rxj,  Ry-)  in  Equation 
(4),  the  results  give  Spearman's  rank  correlation  coefficient.  However, 
since  the  data  now  consists  only  of  integei  values,  Equation  (4)  can  be 
simplified  in  this  special  case  to  give  the  more  common  equation  for  the 
calculation  of  the  rank  c  orrelation  r*  as  follows: 

V1  2 

(Rx  -  Ry.) 

— - 4 - i~  .  K) 

n( n“  -  1) 

An  example  of  the  c  alculation  of  the  rank  correlation  coefficient 
is  given  in  Figure  4  using  the  alary  versus  age  data  of  Company  A.  The 
tank  correlation  has  also  been  calculate  1  for  Companies  B  and  C  using 
Equation  (7)  and  a  -comparison  with  the  product  moment  correlation  given 
in  the  table  below  for  the  three  companies. 
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for  this  ciat.  using  Equation  (4)  and  found  to  be  r  -  0.  7 8.  [  he  data 

points  are  now  ranked  with  respect  to  both  t,  *>  r  x  ar.d  y  coordinate  values 
and  plotted  the  lower  graph  in  Figure  6.  This  ranking  procedure  is 
essentially  a  transformation  of  the  original  data  which  preserves  the  rank 
of  the  observations  but  which  distorts  the  spread  of  the  data  to  give 
observations  which  are  exactly  one  u.  it  apart  in  both  the  horizontal  and 
vertical  directions.  This  transformation  is  introduced  to  simpTfy  calcu¬ 
lations  , but, as  shown  in  ti  e  example  of  Figure  «  ,  this  simplification  is 
often  produced  at  the  expense  o.  considerable  distortion  in  the  relative 
spread  of  the  data.  sine  the  ranked  data,  the  rank  correlation  coefficient 
can  be  calculated  from  Equation  (  i )  or  ft  m  the  s  in- pi  i  tied  form  given 
by  Equation  (7)  and  founa  to  be  r  -  0.  90b  l .  A  comparison  indicates  that 
the  transformation  to  ranked  dan  produces  a  chance  of  more  than  0.  11  in 
tiie  correlation  coefficient  in  this  case. 

In  general,  if  the  origin,.',  data  is  evenly  spread  with  respect  to 
the  two  variables,  then  r  and  r*  will  be  very  close  to  each  u'her, since 
little  distortion  is  produced  by  the  transformation  to  ranked  data.  If, 
however,  a  plot  of  the  original  data  indicates  that  the  points  appear  in  clusters, 
it  is  entirely  possible  that  and  r*  will  be  considerably  different. 

o.  MULTIPLE  LINEAR  REGRESSION 

Suppose  the  investigator  feels  the  dependent  variable  Y  is  related 
to  two  or  more  independent  variables  and  the  ret, >re  wishes  to  use  a  ! unction 
of  several  variables  to  predict  the  dependent  variable  Y.  For  k  independent 
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Figure  7 
Data  Table 
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which  up'!;  s  in :  :>I  i :  i  t  t  >  will  yield  the  .system 
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For  largo  values  of  k  this  system  can  best  be  solved  by  matrix  methods,  but 
if  k  is  small,  a  solution  can  be  obtained  by  substitution  or  by  Kramer's  Rule. 

An  alternative  form  often  usee!  for  the  linear  regression  is  obtained  by 
solving  the  i  r s t  equation  of  the  system  (9)  for  oq  and  making  a  direc  t  substi  ¬ 
tution  into  Equation  (8).  This  yields  the  equation 


“  V 


--v 


+  a  i  x 
k  k 


‘  V 


10) 


Again,  this  form  is  often  preferable  to  Equation  (8)  oecause  it  explicitly 
exhibits  the  means  of  the  variables  used  to  obtain  the  regression  equation. 

It  should  be  noted  that  the  variables  Xj  need  not  be  independ  ot.  In 
fact,  quite  the  contrary;  it  is  possible  for  one  variable  to  be  a  nonlinear  function 
of  some  other  variable.  For  example,  it  is  possible  to  have  X?  r  Xt^  or 
X3  --  X|  i  3X  >  However,  linear  combinations  of  variables,  such  as  X3  =  Xj  + 
2Xi>  where  the  exponents  on  the  independent  variables  are  all  one,  are  not 
pet  mitted.  In  this  ease,  the  mat  rix  >  the  system  of  Equations  (9)  is  singular 
and  the  inverse  does  not  exist;  or, in  other  words,  the  coefficients  a.  are  not 
uniquely  determined  when  any  of  the  variables  is  a  linear  combination  of  some 
of  the  other  variables. 


Consider  now  an  example  employing  the  concept  of  multiple  regression. 
Suppose  an  investigator  wishes  to  predict  che  yield  of  a  crop  from  rainfall  and 
temperature  data  given  in  Figure  7. 
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Assuming  the  prediction  equation  has  the  form  given  by  Equation  (6) 
with  k  =  2,  the  system  of  Equations  (9)  reduces  to  the  tollowir.g  three 
equations : 


10a  -f  67.  9a  +  710,  3a  =  66.  3 

0  1  2 

87.9a  3  84!..  73a  +  6179.08a  =  600.44 

0  1  2 

710. 3aQ  +  6179.08a  +  50537. 19a^  ~  4692.08 

Obtaining  a  simultaneous  solution  to  this  system  for  the  coefficients 
ag,  a;  ai  and  then  substitution  into  Equation  (10)  yields  the  regression 
equation 


y  -  6.  63  +  0.  21413  (x(  -  8.  79)  -  0.  0382  (x?  -  71.  03).  (11) 

In  a  system  like  that  which  was  just  considered,  where  there  is 
more  than  one  independent  variable,  Equation  (1)  cannot  be  u~ed  directly 
to  calculate  the  strength  of  ..he  relationship  given  by  Equation  (11).  There¬ 
fore,  the  methods  of  correlation  will  be  eneralized  to  handle  this  situation. 


7.  THE  CORRELATION  MATRIX 


Suppose  a  problem  is  considered  for  which  there  is  one  dependent 
variable  Y  and  two  independent  variables  Xj  and  Xy  as  in  the  preceding  example. 
Define  r,rv  to  be  the  Pearson  product  moment  correlation  coefficient  calculated 


from  the  observations  on  the  variables  Y  and  Xj  using  Equation  (4)  while 
ignoring  the  observations  on  the  variable  X^.  That  is,  calculate  the 
correlation  coefficient  ry^  between  Y  and  Xj  as  if  the  observations  on  X 


2 


had  never  been  taken.  Similarly,  define  the  product  moment  correlation 
between  all  other  pairs  of  the  three  variables;and  define  ryy  to  be  the 
correlation  of  the  variable  Y  with  itself  while  ignoring  the  observations 
from  the  other  two  variables.  The  correlation  matrix  is  then  defined  to 
be  a  listing  of  the  product  moment  correlation  for  all  pairs  of  variables 
presented  in  the  form 


YY 

rYXl 

^X2 

X1Y 

rxlXl 

V 

Vi 

rx  X 

2  2 

Since  from  Equation  (4),  it  can  be  shown  by  symmetry  that 


rXY  “  rYX 
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and  that 


XX 


it  is  customary  to  present  only  the  upper  nail  of  the  matrix  in  the  form 


YX, 


YX  _ 


‘  X .  X . 


1 


Using  the  date  given  in  Figure  7  and  the  form  given  by  (12)  the 
following  correlation  matrix  o3  obtained: 

"1 


1  +0.9831  -0.8781 

1  -0.8312 

1 


(13) 


This  matrix  indicates  that  there  is  a  strong  direct  relationship 
between  yield  and  rainfall  and  a  strong  inverse  relationship  between  yield 
and  temperature  as  well  as  a  strong  inverse  relationship  between  the  inde¬ 
pendent  variables,  rainfall  and  temperature.  The  main  purpose  of  the 
correlation  matrix  is  to  present  the  correlation  between  ail  pairs  of  variables 
in  a  standard  form  so  that  the  interlocking  relationships  may  be  observed. 

The  concept  of  the  correlation  matrix  may  be  generalized  to  any  number 
of  independent  variables. 


As  another  illustration  of  how  rlosely  the  rank  correlations 

approximate  the  product  moment  correlations,  the  quantities  Ry^,  Ry 

R  were  calculated  as  show,  in  Figure  7.  Using  Equation  (7),  the 
x2i 


li 


rank  correlations  were  calculated  for  all  pairs  of  variables  and  presented 
in  the  following  rank  correlation  matrix: 


1  tO„  9879  -0.8546 

1  -0.8424 

1 


This  example  shows  that  with  only  ten  points,  a  close  approximation  is 
obtained  to  the  product  moment  correlations  using  rank  correlation  methods. 
It  should  be  remembered  that  there  is  a  fairly  even  spread  of  the  data  in 
this  case, however ,  and  one  cannot  expect  the  approximation  to  be  as  good 
if  the  data  points  ar  ’ustered. 


21 


8. 


MULTIPLE  CORRELATION  COEFFICIENTS 


An  examination  of  the  correlation  matrix  (13)  in  view  of  Equation  (5) 
indicates  that  in  this  example  100  x  (0.  9831 )2%  -  96,  6%  oi  the  variation  in 
the  yield  can  be  explained  by  the  linear  regression  of  Y  on  X).  However, 
suppose  the  investigator  feels  that  temperature  also  has  an  effect  on  yield 
and  decides  to  use  both  temperature  and  rainfall  data  to  obtain  a  prediction 
equation  for  yield.  From  the  data  ry'-.en  in  Figure  7  and  using  the  form  of 
Equation  (8),  he  calculates  the  regression  of  Y  on  Xj  and  X^  and  obtains 
Equation  (11).  It  now  becomes  important  to  him  to  obtain  a  measure  of  the 
strength  of  this  prediction  equation  to  determine  how  much  improvement  in 
the  prediction  procedure  has  resulted  from  the  use  of  both  variables  in  the 
regress’cir  equation.  An  examination  of  the  correlation  matrix  (13)  indicates 
that  yield  and  temperature  are  indeed  related  , but , since  rainfall  and  temper¬ 
ature  are  also  relates,  there  remain  some  questions  as  to  whether  both 
variables  are  needed  in  predicting  yield. 


Therefore,  generalizing  Equation  (4)  to  two  independent  variables, 

define  the  square  of  the  multiple  correlation  coeffici  nt  Rv/v  v-  ,  by  the 
,,  mX.A-,) 

equation  * 
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Y(XlX2) 


=  1 


ly.  -  Q-  -  ax  -  ax) 
i=l  l _ 0  1  li  2  2i 

X  (y  -  y ) 
i- 1  1 


(14) 


where  Uq,  Qj  , 
The  quantity  R 


a,  are 

2  2 


Y(XiX2 


obtained  by  the  solution  of  the  system  (V)  with  k  =  2. 

,  thus  gives  the  fraction  reduction  in  the  variation  of 


the  \  riable  Y  explained  by  the  linear  regression  of  Y  on  Xj  and  X,-  Some 
properties  of  the  multiple  correlation  coefficient  are  as  follows: 


(a)  For  any  set  of  data  points  0  <  Ry^jf  ^  ,)  ~  1  ■ 

(b)  If  R V,v  v  ,  is  close  to  one  then  the  linear  regression  of  Y  on 

i  ( A  j  a  2 ) 

Xj  and  X;?  is  a  good  prediction  equation  for  Y. 

(c)  If  Ry^x  X  )  *8  srtla^  then  the  linear  regression  of  Y  on  Xj  and 
X2  ifi  not  a  good  prediction  equation  for  Y. 


Calculating  now  the  multiple  correlation  coefficient  for  the  regression 
Equation  (11)  using  the  data  in  Figure  7  together  -with  Equation  (14)  it  was 
found  that 


R 


Y(XjX  ) 


0. 9890 


(15) 


n 


This  result  indicates  that  100  x  (0.9890)  %  =  97  8%  of  the  variation  in 
yield  can  he  explained  by  the  linear  regression  of  Y  on  Xj  and  X^.  Previously, 
it  was  shown  that  96.  6%  of  the  variation  in  yield  could  be  explained  by  the 
regression  of  Y  on  Xitso  that  a  subtraction  indicates  that  an  additional  1.  2% 
of  the  variation  in  yield  can  be  explained  with  the  addition  of  the  variable 
X ,  to  the  prediction  equation. 

Equation  (4;  can  be  further  generalized  to  k  variables  where  the 
square  of  the  multiple  correlation  coefficient  Ry/v  y  Y  ,  is  defined 
by  the  equation  * 


RY(XiX2...  xk)  r  1 


Pi  'vt  -ao  -  Vii 

n  -  2 
^(Y.-V) 


Vki) 


where  an  .  .  .  Ql  are  obtained  by  the  solution  of  the  system  (9). 


d.  GEOMETRICAL  CONSIDERATIONS 

The  geometrical  representation  of  the  regression  equation  Y  on 
xL  ...  xk  is  that  ol’  a  k -dimensional  hyperplane  in  k+ 1 -dimensional  Euclidian 
space  which  is  best  in  the  least  squares  sense.  This  geometrical  repre¬ 
sentation,  although  mathematically  interesting  ,has  little  practical  use. 

There  is,  however,  another  geometrical  approach  which  is  very  useful  in 
exhibiting  some  important  characteristics  of  the  data.  Consider  the  example 
given  in  Figure  7,and  define  (xy,  x^j,  y  ■ )  to  be  the  rainfall,  temperature 
and  yield  triple  for  the  ith  year.  Using  the  tec  triples,  the  regression  of 
v  on  Xj  and  X  ?  was  calculated  for  this  data  and  given  in  Equation  (11). 

Using  the  triple  (xjj,  \'j),  define  z j  to  he  the  predicted  yield  tor  the  ith 

year  obtained  by  substituMng  the  values  of  the  independent  variables  Xj 
and  Xj  into  the  regression  equation  (11)  for  the  ith  year  yielding  the 
equation 


7. .  -  6.63  t  0.21413  (x,. 

i  It 


8.  7<») 


0.  0382  (x  -  71. 03) 


A  plot  ol  Z  versus  Y  for  the  ten  years  is  given  in  Figure  8  together  with  a 
tabulation  of  /.j  using  Equation  (17).  This  graph  demonstrates  the  ability 
of  the  regression  equation  to  predict  ti.c  dependent  variable.  That  is,  if 
the  dependent  variable  Y  is  strongly  related  to  the  linear  regression  of  i 
on  Xj  and  X^,  then  Zj  will  be  a  good  prediction  of  yj  and  the  points  (zj,  yj) 
will  be  close  to  the  line  /.  y.  If,  on  the  other  hand,  Y  is  only  weakly  related 


Predicted  Yield 


to  the  regression  of  Y  on  Xj  and  then  the  plot  will  exhibit  a  greater 

spread  about  the  line  z  =  y.  Figure  8  also  can  be  used  to  divide  the  YZ 
plane  into  two  regions  divided  by  the  line  z  =  y.  Any  point  in  the  region 
above  the  line  z  =  y  has  an  observed  value  y^  smaller  than  predicted  by 
the  regression  equation  while  all  those  points  below  the  line  z  =  y  have  an 
observed  y^  larger  than  predicted.  Therefore,  any  unusual  point  can  be 
detected  merely  by  noting  those  points  with  the  larges*  deviations  from  the 
line  z  =  y. 

Another  interesting  result  as  a  consequence  of  Figure  8  comes  to 
light  if  the  product  moment  correlation  coefficient  ?YZ  *s  calculatecf  between 
the  variables  Y  and  Z.  It  should  be  remembered  that  the  constants  a^,  aj, 
a  ^  were  originally  chosen  so  as  to  minimize  the  sum 


Z  (Yi  ~  Qo 

i=l 


alXli 


a  x  ) 

2  ?.i 


=  2  (y.  - 
£1  ‘  1 


Thus,  for  the  calculation  of  ry^,  the  solution  for  the  constants  a  and  b  from 
Equation  (5)  is  a  =  0,  b  =  1  and  because  of  the  relationship  given  by 
Equation  (18) 


Y<X1X2) 


In  general,  these  results  state  that  the  multiple  correlation 


coefficient  R, 


(.ucuiuem  i\y(x  X  j  can  be  interpreted  as  the  product  moment  correlation 

1  h 

rYZ  between  the  dependent  variable  Y  and  the  variable  Z,  the  linear  combination 
of  the  independent  variables  Xj  .  .  .  X^  given  by  the  regression  equation. 
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10.  PARTIAL  CORRELATION  COEFFICIENTS 


When  the  relationship  between  two  variables  is  being  determined, 
quite  frequently  the  true  relationship  between  these  two  variables  is  disguised 
by  a  common  relationship  to  a  third  variable.  For  example,  in  the  problem 
just  considered,  it  was  found  that  the  correlation  between  yield  and  temper¬ 
ature  was  -0.8781.  An  examination  of  the  correlation  matrix  (13)  shows 
that  both  temperature  and  yield  are  strongly  related  to  the  third  variable, 
which  is  rainfall  in  this  case.  One  might  therefore  ask  the  question,  ’Ts 
the  relationship  between  yield  and  temperature  as  strong  as  indicated  by  the 
correlation  coefficient , or  is  this  correlation  c  oefficient  strengthened  by 
the  strong  dependence  of  both  variables  upon  rainfall?" 


It  does  not  seem  unreasonable  to  consider  the  possibility  that 
high  rainfall  as  a  cause  produces  an  effect  of  high  yield  and  low  temperature, 
and  that, in  reality,  temperature  only  appears  to  affect  yield  because  of  its 
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product  moment  correlation  matrix  (13).  With  a  little  patience, 
shown  from  Equation  (4)  that 


it  can  be 


fAYAX? 
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(2  3) 


Calculation  cf  the  partial  correlation  from  Equation  (25)  is  often 
preferable  ,since  no  information  beyond  the  correlation  matrix  is  required. 


In  the  problem  under  consideration  it  was  shown  that  9b,  t>%  of  the 
variation  in  yield  was  explained  by  the  linear  regression  of  yield  on  rain¬ 
fall.  Thus,  by  a  subtraction  from  100%,  3.  4 To  of  the  variation  in  yield  was 
not  explained  by  its  linear  regression  on  rainfall.  The  square  of  the  partial 
correlation  coefficient  ry^  ,  ^  gives  the  iraction  of  the  unexplained 


variation  in  yieid  which  can  be  explained  by  the  variation  in  temperature. 
That  is,  100  x  (-0.  5983)2%  =  33.  3%  of  the  variation  in  yield,  unexplain.ee 

by  the  linear  regression  of  yield  on  rainfall ,  can  be  explained  by  temper¬ 
ature  variation.  Or  equivalently,  an  additional  1  00  x  10.  0 34 ) ( 0 .  333)%  - 
1.2%  of  the  variation  in  yield  can  be  explained  by  temperature  variation 
over  and  above  what  is  explained  by  rainfall.  Adding  these  results  together, 
(96.  6  +  1.  2)%  =  97.  8%  of  the  variation  in  yield  can  be  explained  by  rainfall 
and  temperature  together.  It  will  be  remembered  that  this  is  exactly  the 
same  result  that  was  obtained  with  the  u~e  of  the  multiple  correlation 
coefficient  given  by  Equation  (15).  Thus,  the  relationship  whb  h  exists 
between  the  multiple  correlation  coefficient  and  the  partial  correlation 
coefficient  can  be  expressed  by  the  equation 


RY(X,X2> 
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YX 
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+  (1 
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YX 


(3b) 


Formulae  have  been  developed  here  for  the  partial  correlation  coefficient 
for  the  elimination  of  the  effect  of  one  variable  tr.  considering  the  relation¬ 
ships  that  exist  among  a  set  of  variables.  The  theory  can  be  extended  to 
the  elimination  of  more  than  one  variable, and  equations  for  this  process 
are  given  by  Kendal.  (4). 


dependence  upon  rainfall.  To  examine  tne  true  dependence  of  yield  upon 
tem  perature  one  would  like  to  eliminate  the  effect  of  rainfall  in  examining 
the  variation  in  yield  and  temperature.  It  is  not  immediately  obvious 
how  this  can  he  accomplished,  because  rainfall  is  not  a  controllable 
variable;  however,  one  acceptable  way  of  eliminating  the  effect  of  rainfall 
will  now  be  considered. 

Using  Equation  (3)  calculate  the  regression  of  Y  on  X  j  and  use 
the  form  of  the  regression  equation  given  by  Equation  (3a).  Define 

Ay.  =  y.  -  y  -  o(x. .  -  O  .  (21) 

i  i  Lil 

Av;  is  the  difference  between  the  actual  yield  and  that  predicted  by 
the  regression  equation  of  Y  on  X|.  Similarly  define  Ax^  to  be  the  ditferenc 
between  the  actual  average  temperature  and  the  temperature  predicted  by 
the  regression  of  Xi  on  Xj.  The  quantity  Ax2i  ls  given  bv  the  equation 

Ax  -  x  .  -  x  ,  -  b 1  (x .  .  -  x ,  )  .  ( 2  Z ) 

Z i  Z  i  lil 

A\'y  and  Ax2;  designate  tiie  variation  in  yield  and  temperature, 
respectively,  which  cannot  be  explained  by  the  linear  regression  on  rainfall. 
These  quantities  are  calculated  for  this  example  and  are  presented  in 
Figure  r*. 


Calculating  now  the  product  moment  correlation  coefficient  between 
AY  and  AX  >  giver  bv  Equation  (41.  one  obtains 
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AY  AX 


-0.  dog  30  . 
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This  quantity  r2,\AX->  15  A;'‘own  as  the  partial  correlation  of  Y  and  Xi  with 

the  effort  or  the  '.unable  X  •  removed.  A  more  c  omm  on  notation  that  is 

used  m  most  texts  tor  the  partial  cor  relation,  as  above  is  r 

,  .  .  IX).  X | 

yielding  the  «  tationa*  utentitv  *•  1 


‘yx.-x,  ‘ayax, 


f  2  4 ) 


It  .  s  :i ot  necessary  to 
wav  .mb.cated  ;r.  F:gu 


late  the  partial  cor  relation  >oe::'u;ient  m  the 
Rather.it  can  be  na lc nlated  directly  from  the 
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